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Abstract — This paper establishes the equivalence between 
cognitive medium access and the competitive multi-armed bandit 
problem. First, the scenario in which a single cognitive user 
wishes to opportunistically exploit the availability of empty fre- 
quency bands in the spectrum with multiple bands is considered. 
In this scenario, the availability probability of each channel is 
unknown to the cognitive user a priori. Hence efficient medium 
access strategies must strike a balance between exploring the 
availability of other free channels and exploiting the opportunities 
identified thus far. By adopting a Bayesian approach for this 
classical bandit problem, the optimal medium access strategy is 
derived and its underlying recursive structure is illustrated via 
examples. To avoid the prohibitive computational complexity of 
the optimal strategy, a low complexity asymptotically optimal 
strategy is developed. The proposed strategy does not require 
any prior statistical knowledge about the traffic pattern on the 
different channels. Next, the multi-cognitive user scenario is 
considered and low complexity medium access protocols, which 
strike the optimal balance between exploration and exploitation 
in such competitive environments, are developed. Finally, this 
formalism is extended to the case in which each cognitive user is 
capable of sensing and using multiple channels simultaneously. 

I. Introduction 

Recently, the opportunistic spectrum access problem has 
been the focus of significant research activity [l]-[3]. The 
underlying idea is to allow unlicensed users (i.e., cognitive 
users) to access the available spectrum when the licensed 
users (i.e., primary users) are not active. The presence of high 
priority primary users and the requirement that the cognitive 
users should not interfere with them define a new medium 
access paradigm which we refer to as cognitive medium access. 
The overarching goal of our work is to develop a unified 
framework for the design of efficient, and low complexity, 
cognitive medium access protocols. 

The specttal opportunities available to the cognitive users 
are expected to be time-varying on different time-scales. For 
example, on a small scale, multimedia data traffic of the 
primary users will tend to be bursty [4]. On a large scale, one 
would expect the activities of each user to vary throughout the 
day. Therefore, to avoid interfering with the primary network, 
the cognitive users must first probe to determine whether there 
are primary activities in each channel before transmission. 
Under the assumption that each cognitive user cannot access 
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all of the available channels simultaneously, the main task of 
the medium access protocol is to distributively choose which 
channels each cognitive user should attempt to use in different 
time slots, in order to fully (or maximally) utilize the spectral 
opportunities. This decision process can be enhanced by taking 
into account any available statistical information about the 
primary traffic. For example, with a single cognitive user 
capable of accessing (sensing) only one channel at a time, the 
problem becomes trivial if the probability that each channel 
is free is known a priori. In this case, the optimal rule is 
for the cognitive user to access the channel with the highest 
probability of being free in all time slots. However, such time- 
varying traffic information is typically not available to the 
cognitive users a priori. The need to learn this information 
on-line creates a fundamental tradeoff between exploitation 
and exploration. Exploitation refers to the short-term gain 
resulting from accessing the channel with the estimated highest 
probability of being free (based on the results of previous 
sensing decisions) whereas exploration is the process by 
which the cognitive user learns the statistical behavior of the 
primary traffic (by choosing possibly different channels to 
probe across time slots). In the presence of multiple cognitive 
users, the medium access algorithm must also account for the 
competition between different users over the same channel. 

In this paper, we develop a unified framework for the design 
and analysis of cognitive medium access protocols. As argued 
in the sequel, this framework allows for the construction of 
strategies that strike an optimal balance among exploration, 
exploitation and competition. The key observation motivat- 
ing our approach is the equivalence between our problem 
and the classical multi-armed bandit problem (see [5] and 
references therein). This equivalence allows for building a 
solid foundation for cognitive medium access using tools from 
reinforcement machine learning [6]. The connection between 
cognitive medium access and the multi-armed bandit problem 
has been independently and concurrently observed in [7]. That 
work, however, is limited to special cases of the general 
approach presented here. In particular, in [7], the channels are 
assumed to be independent and the goal is to maximize the 
discounted sum of throughput, which is the problem addressed 
in Example 4 in Section III below. A related work also appears 
in [8], in which the availability of each channel is assumed to 
follow a Markov chain, whose transition matrix is known to 
the cognitive user. The only uncertainty faced by the cognitive 
user in that work is the particular realization of the channel, 
while in our work the cognitive users also need to learn the 
statistics of the channel in real time. 

We consider three scenarios in this paper. In the first 
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scenario, we assume the existence of a single cognitive user 
capable of accessing only a single channel at any given 
time. In this setting, we derive an optimal sensing rule that 
maximizes the expected throughput obtained by the cognitive 
user. Compared with a genie-aided scheme, in which the 
cognitive user knows a priori the primary network traffic 
information, there is a throughput loss suffered by any medium 
access strategy. We obtain a lower bound on this loss and 
further construct a linear complexity single index protocol that 
achieves this lower bound asymptotically (when the primary 
traffic behavior changes very slowly). In the second scenario, 
we design distributed sensing rules that account for the com- 
petitive dimension of the problem in which the cognitive users 
must also take the competition from other cognitive users 
into consideration when making sensing decisions. We first 
characterize the optimal distributed sensing rule for the case in 
which the traffic information of the primary network is avail- 
able to the cognitive users. Under this idealistic assumption, 
we show that the throughput loss of the proposed distributed 
sensing rule, compared with a throughput optimal centralized 
scheme, goes to zero exponentially as the number of cognitive 
users increases. To prevent any possible misbehavior by the 
cognitive users, we further design a game theoretically fair 
sensing rule, whose loss compared with the throughput optimal 
centralized rule also goes to zero exponentially. Building on 
these results, we then devise distributed sensing rules that do 
not require prior knowledge about the traffic and converge to 
the optimal distributed rule and game theoretically fair rule, 
respectively. In the third scenario, we extend our work to the 
case in which the cognitive user is capable of accessing more 
than one channel simultaneously. 

The rest of the paper is organized as follows. Our net- 
work model is detailed in Section II. Section III analyzes 
the scenario in which a single cognitive user capable of 
sensing one channel at a time is present. The extension to the 
multi-user case is reported in Section IV whereas the multi- 
channel extension is studied in Section V. Finally, Section VI 
summarizes our conclusions. 

II. Network Model 

Throughout this paper, upper-case letters (e.g., X) denote 
random variables, lower-case letters (e.g., x) denote realiza- 
tions of the corresponding random variables, and calligraphic 
letters (e.g, X) denote finite alphabet sets over which cor- 
responding variables range. Also, upper-case boldface letters 
(e.g., X) denote random vectors and lower-case boldface let- 
ters (e.g., x) denote realizations of the corresponding random 
vectors. 

Figure 1 shows the channel model of interest. We consider a 
primary network consisting of N channels, Af = {1, ■ ■ ■ , N}, 
each with bandwidth B. The users in the primary network are 
operated in a synchronous time-slotted fashion. We use i to 
refer to the channel index, j to refer to the time-slot index 
and k referring to the index of the cognitive users. We assume 
that at each time slot, channel i is free with probability 0j. Let 
Zi(j) be a random variable that equals 1 if channel i is free at 
time slot j and equals otherwise. Hence, given 9%, Zi{j) is 



a Bernoulli random variable with probability density function 
(pdf) 

heMU)) = W + 

where <5(-) is the delta function. Furthermore, for a given = 
•■ ,9n), Zi(j) are independent for each i and j. We 
consider a block varying model in which the value of is 
fixed for a block of T time slots and randomly changes at the 
beginning of the next block according to some joint pdf f(8). 
Our results can also be extended to the scenarios in which 
Zi(j)s follow a Markov chain model. 
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Fig. 1. Channel model. 

In our model, the cognitive users attempt to exploit the 
availability of free channels in the primary network by sensing 
the activity at the beginning of each time slot. Our work seeks 
to characterize efficient strategies for choosing which channels 
to sense (access). The challenge here stems from the fact that 
the cognitive users are assumed to be unaware of a priori. 
We consider two cases in which the cognitive user either has or 
does not have prior information about the pdf of 9, i.e., f(6). 
To further illustrate the point, let us consider our first scenario 
in which a single cognitive user capable of sensing only one 
channel is present. At time slot j, the cognitive user selects 
one channel S(j) E Af to access. If the sensing result shows 
that channel S(j) is free, i.e., Zg/j\(j) = 1, the cognitive user 
can send B bits over this channel; otherwise, the cognitive user 
will wait until the next time slot and pick a possibly different 
channel to access (throughout the paper, it is assumed that the 
outcome of the sensing algorithm is error free). Therefore, the 
total number of bits that the cognitive user is able to send over 
one block (of T time slots) is 



w = J2bz su) (j). 



It is now clear that W is a random variable that depends 
on the traffic in the primary network and, more importantly 
for us, on the medium access protocols employed by the 
cognitive user. Therefore, the overarching goal of Section III 
is to construct low complexity medium access protocols that 
maximize 



e{w} = e\J2bz SU) (j) 



(1) 



Intuitively, the cognitive user would like to select that 
channel with the highest probability of being free in order 
to obtain more transmission opportunities. If is known then 
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this problem is trivial: the cognitive user should choose the 
channel i* — argmax^ to sense. The uncertainty in 9 

imposes a fundamental tradeoff between exploration, in order 
to learn 9, and exploitation, by accessing the channel with the 
highest estimated free probability based on current available 
information, as detailed in the following sections. 

III. Single User-Single Channel 

We start by developing the optimal solution to the single 
user-single channel scenario under the idealized assumption 
that f(9) is known a priori by the cognitive user. As argued 
next, the optimal medium access algorithm suffers from a 
prohibitive computational complexity that grows exponentially 
with the block length T. This motivates the design of low com- 
plexity asymptotically optimal approaches that are considered 
next. Interestingly, the proposed low complexity technique 
does not require prior knowledge about f(9). 

A. Bayesian Approach 

Our single user-single channel cognitive medium access 
problem belongs to the class of bandit problems. In this 
setting, the decision maker must sequentially choose one 
process to observe from N > 2 stochastic processes. These 
processes usually have parameters that are unknown to the 
decision maker and, associated with each observation is a 
utility function. The objective of the decision maker is to 
maximize the sum or discounted sum of the utilities via a 
strategy that specifies which process to observe for every 
possible history of selections and observations. The following 
classical example illustrates the challenge facing our decision 
maker: A gambler enters a casino having N slot machines, the 
i th of which has winning probability 9 i} i 6 jV. The gambler 
does not know the values of the 8iS and must sequentially 
chooses machines to play. The goal is to maximize the overall 
gain for a total of T plays. In this example, the stochastic 
processes are the outcomes of the slot machines, the utility 
function is the reward that the gambler gains each time and 
the gambling strategy specifies which machine to play based 
on each possible past information pattern. A comprehensive 
treatment covering different variants of bandit problems can 
be found in [5]. 

We are now ready to rigorously formulate our problem. The 
cognitive user employs a medium access strategy T, which 
will select channel S(j) G Af to sense at time slot j for 
any possible causal information pattern obtained through the 
previous j — 1 observations: 

= {*(!), • ' ' ,s(j - l),* a y_i)(j - 1)}, j > 2, 

i.e. s(j) — r(/, Vl'(j)). Notice that z s ^(j) is the sensing 
outcome of the jth time slot, in which s(j) is the channel 
being accessed. If j — 1, there is no accumulated information, 
thus = 4> and s(l) = T(f). T could be stochastic, i.e., for 
certain the cognitive user may randomly pick channel i 

from a set A C J\f with probability p t , such that Pi = 1- 

ieA 

The utility that the cognitive user obtains by making decision 
S(j) at time slot j is the number of bits it can transmit at time 



slot j, which is BZs/j)(j). We denote the expected value of 
the payoff obtained by a cognitive user who uses strategy T 
as 



Wv 



T 

E f C£BZ s{j) (j) 



(2) 



We denote V*(f,T) = supW-r, which is the largest 

r 

throughput that the cognitive user could obtain when the 
spectral opportunities are governed by f(9) and the exact 
value of each realization of 9 is not known by the user. 

Each medium access decision made by the cognitive user 
has two effects. The first one is the short term gain, i.e., an 
immediate transmission opportunity if the chosen channel is 
found free. The second one is the long term gain, i.e., the 
updated statistical information about f(9). This information 
will help the cognitive user in making better decisions in 
the future stages. There is an interesting tradeoff between the 
short and long term gains. If we only want to maximize the 
short term gain, we can pick the one with the highest free 
probability to sense, based on the current information. This 
myopic strategy maximally exploits the existing information. 
On the other hand, by picking other channels to sense, we 
gain valuable statistical information about f{9) which can 
effectively guide future decisions. This process is typically 
referred to as exploration. 

More specifically, let f J (9) be the updated pdf after making 
j — 1 observations. We begin with f 1 (9) = f(9). After 
observing z s (j)(j), we update the pdf using the following 
Bayesian formula. 

1) Uz s{j) (j) = l 



f +1 (0) = 



OsU)P(0) 

fe s{j) P(9)d9> 



2) Uz aU) (j)=0 



j(i-e s{j) )P(e)d9- 



(3) 



(4) 



Now, lemma 2.3. 1 of [5] proves that every bandit problem with 
finite horizon has an optimal solution. Applying this result to 
our set-up, we obtain the following. 

Lemma 1: For any prior pdf /, there exists an optimal 
strategy V* to the channel selection problem (2), and V * (/, T) 
is achievable. Moreover, V* satisfies the following condition: 

V*(f, T) = wbkKj {BZ s(1) + V* (f Zs(1> , T - 1) } , (5) 

where fz s(1) is the conditional pdf updated using (3) and (4) 
as if the cognitive user chooses s(l) and observes Z s n\. Also, 
V* { fz 3(1) , T — l) is the value of a bandit problem with prior 
information fz 3(1) and T — 1 sequential observations. □ 
In principle, Lemma 1 provides the solution to problem (2). 
Effectively, it decouples the calculation at each stage, and 
hence, allows the use of dynamic programming to solve the 
problem. The idea is to solve the channel selection problem 
with a smaller dimension first and then use backward deduc- 
tion to obtain the optimal solution for a problem with a larger 
dimension. Starting with T = 1, the second term inside the 
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expectation in (5) is 0, since T — 1 = 0. Hence, the optimal Hence, for this case, we have 
solution is to choose channel i with the largest Ej{BZi}, 

which can be calculated as / {s(1)=Ms(1)=1} = \s(0.1, 0) + |<5(0.8, 1). 



E f {BZi} = B J 9if(6)de. 



3 v ' ' 3 
Similarly, we obtain the following updated pdf 



And V*(f, 1) = maxE / {BZ 4 }. 18 1 

With the solution for T = 1 at hand, we can now solve the f {*(i)=i,z„(i)=°} = T^ ' 1 ' °) + Y^ ' 8 ' ^ 

T = 2 case using (5). At first, for every possible choice of /{a(i)=2,« s(1) =i} = <5(0.8, 1), 

s(l) and possible observation 2 s (i), we calculate the updated j- _ srr, ^ q\ 

pdf / Zs(1) using (3) and (4). Next, we calculate V*(f Za(1) ,l) J{«(i)=2,*. ( i)=o} k • , 

(which is equivalent to the T = 1 problem described above). , , . , . .. . ,, . , £ 

\ , . / 2) With the updated distribution information, we solve four 

Finally, applying (5), we have the following equation for the , , , . , , ... „ . „ , ... 

, , . , , . , rr, „ channel-selection problems with T = 1. For example, with 

channel selection problem with 7 =2 , l c/ n 1 n\ , 2 jvn o 1 \ -.c *t. 

/{ a (i)=i,z s(1) =i} = ^(0.1,0) + §(5(0.8,1), if the cognitive 

T .», , „. /" r „. . T .». , user choose channel 1, the expected payoff would be 

V*(f,2) = max / [B9, + 9iV*(f Zi = u l) F F ^ 

+(1 - ^*(/. i= o, 1)] f(0)d0. 100 x ( i x 0.1 + - x 0.8 ) = — . 

\3 3 / 3 

Correspondingly, the optimal solution is r*(/) = 

argmaxF*(/, 2), i.e., in the first step, the cognitive If the cognitive user choose channel 2, the expected payoff 

user should choose = arg max V* (/, 2) to sense. After w °uld be 

observing -Zj.m, the cognitive user has *(1) = {zwn}, and (\ 2 \ 200 

100 x-x0+-xl 



it should choose i* (2) = arg max V*(f z ., n .,l) implying that \3 3 J ~3 

r*(/,*(l)) = argrnaxF*(/ Zi , (1) ,l). 

Thus 

Similarly, after solving the T = 2 problem, one can proceed 
to solve the T = 3 case. Using this procedure recursively, we n r , _„ ,„ onn /Ql onn ,„ 

can solve the problem with T — 1 observations. Finally, our V (/«D=i...(i,=i}' *) = ^{170/3,200/3} = 200/3, 

original problem with T observations is solved as follows. , , . , , , . 

° r and the user should choose channel 2. 

V*(f, T) = max ( [B8 t + 9 i V*(f z .= 1 ,T - 1) Similarly, we have 

ieJV J 

+ (1 - ^)^*(/, i= o, T - 1)] f(9)de. V*(f {a{1)=liXaW=0} , 1) = 100xmax{26/190, 1/19} = 260/19, 

£»»>ipfe 7: Suppose we have two channels and two obser- and ^ ufjer shouM choQse channd L 
vations per block, i.e., N = {1, 2} and T = 2. The channels 

are known to be either both very busy or both relatively idle V*(fs s(1 )-2 z m -i>, 1) = max{80, 100} = 100, 

which is reflected in the following joint pdf wi«w-A*. ( i)-i>> ; 

... . . 4 1 and the user should choose channel 2. 

f (01,02) = -=o{0.1, 0) + -4(0.8, 1), 



where 5(a;, y) is the delta function at point (x,y). For sim- ^*(/{ s (i)=2,*„ (1) =o}, 1) = max{10,0} = 10, 

plicity of presentation, we assume that B = 100. 

In this example, on the average, channel 1 is available with 
probability 4/5 x 0.1 + 1/5 x 0.8 = 0.24, whereas channel 2 is 3 ) Finall y< we solve the problem with pdf / and T = 2. 
available with probability 4/5x0+1/5x1 = 0.2. Hence, if the If the cognitive user chooses channel 1 in the first step, we 
cognitive user ignores the information gained from sensing, calculate 
it should always choose channel 1 to sense, resulting in an 

average throughput of 2x0.24x100= 48 bits per block. Now, ^■f{BZ 1 + V*(f Zl ,l)} 

we use the procedure described above to derive the optimal = P(9 1 = 0.1) 100 x 0.1 + 0.1 x V*(/{ g (i) = i z l = i}, 1) 

rule and corresponding throughput. L 

1) First calculate all possible updated pdf after one step. +(1 - 0.1) x ^*(/{ s (i)=i^ (s(1) =o}, 1) 

If s(l) = 1, z s (d = 1, we have r 

U +P(0i =0.8) 100x0.8 + 0.8 xr(/ {l(1)=lA(1)=1)l l) 

P(9 1 =0.1,6 2 = 0\z s{1) = l) 

P( Zl = ljgi = 0.1,e 2 = O)P(0i = 0-1,02 = 0) +<yl ~ °' 8 ^ X v *(fMV=i,*.m=oh x ) 

~ P( Zl = 1) = 252/5. 

0.1 x 0.8 _ 1 

~ 08x01 + 02x08 _ 3 Similarly, if the cognitive user chooses channel 2 in the first 



and the user should choose channel 1. 
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step, we calculate 

E f {BZ 2 + V*(f Z2 ,l)} 

= P(9 2 = 0) [lOO x + x V*(f M1)=2iXall)=1} , 1) 

+^*(/{ S (l)=2,z, (1) =0},l) 

+P(6 2 = l)h00 + V*(f Ml)=2 , Zs(iy _ 



+ (l-l)V*(/ {a( i) =0> ,. (1) =o } ,l) 

= P(0 2 = O)F*(/ {s(1)=2 ,, s(l)=o}) l) 



+P{9 2 = l)[100 + F*(/ {s(1)=2;Zs(1)=1} ,l)] 



240 



Thus 

V*(/,2) = ma* :E, {BZ s[1) 
s{i)eM 



V* (/z. (1) ,l)} 



= max{252/5, 240/5} = 252/5. 

Hence, the optimal strategy is = 1, z\ = 1) = 2, 

r*(/, 0i = 0) = 1. In other words, the cognitive user should 
sense channel 1 in the first time slot. Interestingly, if channel 
1 is found free, the user should switch to channel 2 in the 
second time slot. On the other hand, if channel 1 is found 
busy, the cognitive user should keep sensing channel 1 at the 
second time slot. Finally, we observe that the optimal strategy 
offers a gain of 12/5 bits, on average, as compared with the 
myopic strategy. □ 

The optimal solution presented above can be simplified 
when f(0) has a certain structure, as illustrated by the 
following examples. 

Example 2: (Symmetric Channels) We have N = 2 chan- 
nels. Without loss of generality, let < 6b < 6 a < 1. At any 
block, either 1) channel 1 has probability 9 a of being free and 
channel 2 has probability 6b of being free or 2) channel 1 has 
probability 6b of being free and channel 2 has probability 9 a 
of being free. The cognitive user does not know exactly which 
case happens. The prior pdf information is thus given by 

m, e 2 ) = e b ) + (i - em A), 

where £ is a parameter. The optimal strategy under this 
scenario is the following. 

1) At the first time slot, choose channel 1, if £ > 1/2. 
If £ = 1/2, randomly choose channel 1 or channel 2. 
Otherwise choose channel 2. 

2) At time slots j > 2, update the pdf based on = 
{si,z 3l ,--- , sj-i, Zsj-A using (3) and (4). It is easy 
to see that / 3 has the following form 

P(0i,0 2 ) = SjSWa, 6 b ) + (1 - ^)S(6 b , 6 a ). 

Then, choose channel 1 if £j > 1/2, randomly choose 
channel 1 or 2 if ^ = 1/2 and choose channel 2 
otherwise. 

The optimality of this myopic strategy was proved in [9]. 

The previous myopic strategy is also optimal for some other 
special scenarios. For example, if the prior pdf is f(0) = 
£5(a, b) + (1 — £)5(c, d), then any of the following conditions 
ensures the optimality of the myopic strategy [10]: 1) a + b = 
c + d = 1, 2) a < b and c < d, 3) a > b and c > d. □ 



Example 3: (One Known Channel) We have N = 2 chan- 
nels with independent traffic distributions. Channel 1 and 
channel 2 are independent. Moreover, 9 2 is known. The traffic 
pattern of channel 1 is unknown, and the probability density 
function of 6\ is given by /i(#i). 

Since channel 2 is known and is independent of channel 
1, sensing channel 2 will not provide the cognitive user with 
any new information. Hence, once the cognitive user starts 
accessing channel 2 (meaning that at a certain stage, sensing 
channel 2 is optimal), there would be no reason to return to 
channel 1 in the optimal strategy. A generalized version of this 
assertion was first proved in Lemma 4.1 of [11]. Restated in 
our channel selection setup, we have the following lemma. 

Lemma 2: In the optimal medium access strategy, once the 
cognitive user starts accessing channel 2, it should keep pick- 
ing the same channel in the remaining time slots, regardless 
of the outcome of the sensing process. □ 

This lemma essentially converts the channel selection prob- 
lem to an optimal stopping problem [12], [13], where we only 
need to focus on the strategies that decide at which time-slot 
we should stop sensing channel 1, if it is ever accessed. The 
following lemma derives the optimal stopping rule. 

Lemma 3: For any and any T, if 9 2 > A(/i,T), 

then we should sense channel 2. Here 



A(/i,T) 



E 



max 
r(/i)=i 



a{£Si*iCj)} 



(6) 



E fl {M} 

where T are the set of strategies that start with channel 1 and 
never switch back to channel 1 after selecting channel 2; and 
M is a random number that represents the last time slot in 
which channel 1 is sensed, when the cognitive user follows a 
strategy in V. 

Proof: This result follows as a direct application of 
Theorem 5.3.1 and Corollary 5.3.2 of [5]. ■ 
One can now combine Lemma 2 and Lemma 3 to obtain the 
following optimal strategy. 

1) At any time slot j, if channel 2 was sensed at time slot 
j — 1, keep sensing channel 2. 

2) If channel 1 was sensed at time slot j — 1, update the 
pdf P using (3) and (4) and compute A(/f , T - j + 1) 
using (6). If A(/f ,T - j + 1) < 9 2 , switch to channel 

□ 



2; otherwise, keep sensing channel 1. 
Example 4: (Independent Channels) 



N 



We have N independent channels with f{6) — ]J fi(8i)- 

»=i ' 

This case has a simple form of solution in the asymptotic 
scenario T — > oo assuming the following discounted form for 
the utility function 



W = E f ^atBZ s{j) {j) 

where < a < 1 is a discount factor. As discussed in the 
introduction, this scenario has been considered in [7], and the 
optimal strategy for this scenario is the following. 

1) If channel I was selected at time slot j — 1, then we get 
the updated pdf // using equations (3) and (4), based 
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on the sensing result zi(j — 1). For other channels, we 
let fl — f-~ ,Mi ^ l,i 6 Af. That is we only update 
the pdf of the channel which was just accessed (due to 
the independence assumption). 
2) For each channel, we calculate an index using the 
following equation 



A i(//) = max 



Erf {£*£i^i(j)} 



where T is the set of strategies for the equivalent 
One-Known-Channel selection problem (with channel i 
having the unknown parameter) and M is a random 
number corresponding to the last time slot in which 
channel i will be selected in the equivalent One-Known- 
Channel case. Ai is typically referred to as the Gittins 
Index [14]. 

3) Choose the channel with the largest Gittins index to 
sense at time slot j. 

The optimality of this strategy is a direct application of 
the elegant result of Gittins and Jones [14]. Computational 
methods for evaluating the Gittins Index A could be found 
in [15] and references therein. 

B. Non-parametric Asymptotic Analysis and Asymptotically 
Optimal Strategies 

The optimal solution developed in Section III-A suffers 
from a prohibitive computational complexity. In particular, the 
dimensionality of our search dimension grows exponentially 
with the block length T. Moreover, one can envision many 
practical scenarios in which it would be difficult for the 
cognitive user to obtain the prior information f(6). This moti- 
vates our pursuit of low complexity non-parametric protocols 
which maintain certain optimality properties. Towards this 
end, we study in the following the asymptotic performance 
of several low complexity approaches. In this section, we 
analyze non-parametric schemes that do not explicitly use 
f(6), thus the rules T considered in this section depend only 
on 'fy(i) explicitly. We aim to develop schemes that have low 
complexity but still maintain certain optimality. Towards this 
end, we study the asymptotic performance of schemes as the 
block length T increases. This section will be concluded with 
our asymptotically optimal non-parametric protocols which 
require only linear computational complexity. 

For a certain strategy T, the expected number of bits the 
cognitive user is able to transmit through a block with certain 
parameters 6 is 



N 



E ■ 



J2bz SU) (j) \ = ]>>5>Pr{rW?)) = i}- 



Recall that r ( SI/ ( j ) ) = i means that, following strategy T, the 
cognitive user should choose channel i at time slot j, based 
on the available information ^(j). Here Pr{r(\I'(j)) = i} is 
the probability that the cognitive user will choose channel i at 
time slot j, following the strategy T. 

Compared with the idealistic case where the exact value of 6 
is known, in which the optimal strategy for the cognitive user is 



to always choose the channel with the largest free probability, 
the loss entailed by T is given by 



N 



L(0; T) = J2 B0 *' E B E ^ Pr = • 



where Bi- — max{#i,--- ,0n}- We say that a strategy T 
is consistent, if for any 8 e [0, 1] N , there exists [3 < 1 
such that L(9;T) scales as 1 OiT 13 ). For example, consider 
a royal scheme in which the cognitive user selects channel 
i at the beginning of a block and sticks to it. If Bi is the 
largest one among 6, L(6; T) = 0. On the other hand, if Bi 
is not the largest one, L(6;T) ~ 0(T). Hence, this royal 
scheme is not consistent. The following lemma characterizes 
the fundamental limits of any consistent scheme. 

Lemma 4: For any 6 and any consistent strategy V, we have 



l im inf > B 

t-,oo inT - ^ D(6i \\e*y 

i£Af\{i*} V lU l ' 



E 



(7) 



where D(0i\\0i) is the Kullback-Leibler divergence between 
the two Bernoulli random variables with parameters 6i and Q\ 
respectively: 



+ (1 - Bi) In 



1 



1 



Proof: The proof is an application of a theorem proved 
in [16]. More specifically, for a general bandit problem, let 
x be the random payoff obtained by choosing bandit i (not 
necessarily Bernoulli), and we also let hg i (x) be the pdf of x 
for a given B{. 

Let Hi denote the average payoff of bandit i, i.e. 



i = J xhg i (x)da 



IM 



and note that the Kullback-Leibler divergence between bandit 
i and I is given by 



D{e i \\e l ) 



In hg i (x) — In hg l (x) hg i (x)dx 



Let i* = argmax/ii, i.e., the index of the channel with 

ieAf 

the largest average payoff. Under mild regularity conditions 
on hg^x), it has been proved in Theorem 1 of [16] that for 
any consistent strategy T 



(8) 



In our cognitive radio channel selection problem, given 6, x 
is a random variable with 



he t (x) = 6iS(B) + (l-0i)<J(O); 
hence Hi — BB i7 and 

D(B l \\B l ) = ^ln( e f]+(l-B l )lnf^ 



'in this paper, we use Knuth's asymptotic notations l)gi(N) = o(g2(N)) 
means Vc > 0,3AT ,VAf > N ,gi{N) < cg 2 (N), 2) gi (N) = oj(g 2 (N)) 
means Vc > 0,3N ,VN > N ,g 2 (N) < cffi(JV), 3) gi (n) = 0(g 2 (N)) 
means 3c 2 > ci > 0, N , VN > N , c ig2 (N) < gi(N) < c 2 g 2 (N). 
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Substituting these parameters into (8), the proof is complete. 

■ 

Lemma 4 shows that the loss of any consistent strategy 
scales at least as w(lnT). An intuitive explanation of this 
loss is that we need to spend at least O(lnT) time slots on 
sampling each of the channels with smaller Oi, in order to 
get a reasonably accurate estimate of 9, and hence, use it to 
determine the channel having the largest Oi to sense. We say 
that a strategy T is order optimal if L(9; T) ~ O(lnT). 

Now, the first question that arises is whether there exists 
order optimal strategies. As shown later in this section, we can 
design suboptimal strategies that have loss of order O(lnT). 
Thus the answer to this question is affirmative. Before proceed- 
ing to the proposed low complexity order-optimal strategy, we 
first analyze the loss order of some heuristic strategies which 
may appear appealing in certain applications. 

The first simple rule is the random strategy T r where, at 
each time slot, the cognitive user randomly chooses a channel 
from the available N channels. The fraction of time slots 
the cognitive user spends on each channel is therefore 1/N, 
leading to the loss 



L(0;T r ) 



N 
z=l 



-6i) 



N 



-T-O(T). 



The second one is the myopic rule T g in which the cognitive 
user keeps updating ft(0), and chooses the channel with the 
largest value of 

r o l f(e)de 



at each stage. Since there are no converge guarantees for the 
myopic rule, that is 6 may never converge to 9 due to the lack 
of sufficiently many samples for each channel [17], the loss 
of this myopic strategy is 0(T). 

The third protocol we consider is staying with the winner 
and switching from the loser rule T sw where the cognitive 
user randomly chooses a channel in the first time slot. In the 
succeeding time-slots 1) if the accessed channel was found to 
be free, it will choose the same channel to sense; 2) otherwise, 
it will choose one of the remaining channels based on a certain 
switching rule. 

Lemma 5: No matter what the switching rule is, 
L(9;T SW ) -O(T). 

Proof: Let i* = are max Oi and i** — are max Oi, 

i.e., i* is the best channel, and i** is the second best channel. 
To avoid trivial conditions, without loss of generality we 
assume that Oi* ^ Oi*- and Oi* ^ 1. We can upper bound 
the performance of the staying with the winner and switching 
from the loser rule by assuming that the cognitive user has the 
following extra knowledge. 

1) In the first time slot, the cognitive user is able to choose 
i* correctly. 

2) Once i* is sensed busy, the cognitive user somehow 
knows which channel is the second best, and switches 
to i**. 

3) Once i** is sensed busy, the cognitive user is always 
able to switch back to i*. 



We denote this optimistic rule by ^sw- With any realistic 
switching rule Tgw, we have 

L(9;T SW )>L(9;T* SW ). 

Now with the optimistic rule TJ^, the system can be mod- 
elled as the following Markov process as shown in Figure 2, 
in which we have two states: 1) sensing channel i* and 2) 
sensing channel i** , The transition probability matrix is 



P 



1 



1 



The probability Pi** that the cognitive user will sense channel 

1 -6U 




Fig. 2. A Markov process representation of the optimistic strategy Tg^,. 

i** can be obtained by the solving the following stationary 
equation 

from which we obtain 

1 - 0,* 



P.,,* = 



1 - Br + 1 - 0,* 
Hence in the nontrivial cases, we have 



L(0;T 



sw 



= BR**[ 



,.)T, 



implying that, for any switching rule, L(9; T$w) ~ 0(T). ■ 
There are several strategies that have loss of order O(lnT). 
We adopt the following linear complexity strategy which was 
proposed and analyzed in [18]. 

Rule 1: (Order optimal single index strategy) 
The cognitive user maintains two vectors X and Y, where 
each Xi records the number of time slots for which the 
cognitive user has sensed channel i to be free, and each Y$ 
records the number of time slots for which the cognitive user 
has chosen channel i to sense. The strategy works as follows. 

1) Initialization: at the beginning of each block, sense each 
channel once. 

2) After the initialization period, the cognitive user obtains 
an estimation 9 at the beginning of time slot j, given 



by 



Yi(j) ■ 



and assigns an index 




to the i th channel. The cognitive user chooses the 
channel with the largest value of Ai(j) to sense at time 
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slot j. After each sensing, the cognitive user updates X 
and Y. 

The intuition behind this strategy is that as long as Yj grows 
as fast as O(lnT), A; converges to the true value of Q l in 
probability, and the cognitive user will choose the channel with 
the largest Oi eventually. The loss of O(lnT) comes from the 
time spent on sampling the inferior channels in order to learn 
the value of 0. This price, however, is inevitable as established 
in the lower bound of Lemma 4. □ 

Finally, we observe that the difference between the myopic 
rule and the order optimal single index rule is the additional 
term y/2 In j/Yi(j) added to the current estimate Oi. Roughly 
speaking, this additional term guarantees enough sampling 
time for each channel, since if we sample channel i too 
sparsely, Y(j) will be small, which will increase the prob- 
ability that Aj is the largest index. When Y(j) scales as InT, 
Oi will be the dominant term in the index Aj, and hence 
the channel with the largest Oi will be chosen much more 
frequently. 

IV. Multi User-Single Channel 

The presence of multiple cognitive users adds an element of 
competition to the problem. In order for a cognitive user to get 
hold of a channel now, it must be free from the primary traffic 
and the other competing cognitive users. More rigorously, we 
assume the presence of a set K, = {1, • • • , K} of cognitive 
users and consider the distributed medium access decision 
processes at the multiple users with no prior coordination. We 
denote JCi(j) C K, as the random set of users who choose 
to sense channel i at time slot j. We assume that the users 
follow a generalized version of the Carrier Sense Multiple 
Access/Collision Avoidance (CSMA-CA) protocol to access 
the channel after sensing the main channel to be free, i.e., if 
channel i is free, each user k in the set /Q(j) will generate 
a random number tk(j) according to a certain probability 
density function g, and wait the time specified by the generated 
random number. At the end of the waiting period, user k senses 
the channel again, and if it is found free, the packet from user k 
will be transmitted. The probability that user k in the set fCi(j) 
gains access to the channel is the same as the probability that 
tk(j) is the smallest random number generated by the users in 
the set fCi(j). Thus, the throughput user k achieves in a block 
is 

W k = J2 BZ SkU) (j)I \k = arg min t q (j)\ . 

Therefore, user k should devise sensing rule T k that maxi- 
mizes 



E {W k } = E jfj BZ Sk{j) (j)I jfc = arg 

Clearly, with multiple cognitive users, it is not optimal 
anymore for all the users to always choose the channel with 
the largest Oi to sense. In particular, if all the users choose 
the channel with the largest Oi, the probability that a given 
user gains control of the channel decreases, while potential 



opportunities in the other channels in the primary network are 
wasted. 

A. Known 8 Case 

To enable a succinct presentation, we first consider the case 
in which the values of are known to all the cognitive users. 
The users distributively choose channels to sense and compete 
for access if the channels are free. 

1 ) The Optimal Symmetric Strategy: Without loss of gener- 
ality, we consider a mixed strategy where user k will choose 
channel i with probability p k .i- Furthermore, we let = 
\pk,i, ■ ■ ■ ,Pu,n\ and consider the symmetric solution in which 
p = pi = = pk- The symmetry assumption implies 
that all the users in the network distributively follow the same 
rule to access the spectral opportunities present in the primary 
network, in order to maximize the same average throughput 
each user can obtain. The following result derives the optimal 
solution in this situation. 

Lemma 6: For a cognitive network with K > 1 cognitive 
users and N channels with probability 9 of being free, the 
optimal p* is given by 




for Oi = 0, 



N 



where A* is a constant such that p* = 1. Here {x} + = 

»=i 

max{0, x}. 

Proof: With a strategy p, the probability that user k 
chooses channel i and, at the same time, there are I other 
users choosing channel i to sense is 



[>■< 



K - 1 



Pi(l-Pi) 



K-l- 



Under this scenario, the average bits transmitted at one slot of 
each user is B9i/(l + 1), Hence, the average throughput Wk 
of user k is 



" BO, ^ (K-\ 



4=1 (=0 ^ ' 

Based on our symmetry assumption, we drop the subscript k 
and write the average throughput of each user as W leading 
to 

N K-l 



W = BT^piOi 

»=i i=o 

N K-l 



K-l\ p\{l-pi) 



K-l-l 



i=l 1=0 
N „ K-l 



I ) l + l 

(K-l)! p\{l- Pl ) K - 1 - 1 
l\[K-\-l)\ l + l 



K 



"•ESE^w-a-F.) 

i=l i=o v 



K-l-l 



N 



K 



i=l (l'=0 



K 



^£^{i-(i-^n. 
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Now, we should solve the following optimization problem 



N 



max W = BT'£-i{l -(!-&)*}, 



K 

1=1 

N 

s.t. ^2 Pi = i, 

i=l 

p > 0. 

This optimization problem is equivalent to the following: 

N 

min y = ^0,(1 - P-i) K , 



i=l 



N 



S.t. ^Pi = l, 

i=l 

p > 0. 



(9) 



Since 



d 2 y 



^(X-l)(l-p,) K ^>0 



for < Pi < 1, y is a convex function of p in the region 
of interest, i.e. p G [0, 1]^. Also, the constraints are the 
intersection of a convex set and a linear constraint. Therefore, 
our problem reduces to a convex optimization problem whose 
Karush-Kuhn-Tucker (KKT) conditions [19] for optimality are 



P* > 0, 



N 



£rf = L 



i=l 

p*(X*-K6 i (l-p*) K - 1 ) 



0, 



A* > K6i(l-p*) 



K-l 



where A* is the Lagrange multiplier. 
It is easy to check that if K > 1, 



1 



(A) 



i/(if-i) 







for 9i > 0, 
for 0i = 0, 



(10) 



satisfies the KKT conditions, in which A* is the constant that 
satisfies ^p* = 1- ■ 
If K = 1, then p*. = 1, where i* = arg max 0j, p* t = 0, 

and Z G Af\{i*}, satisfies the KKT conditions. 

So, the total throughput of the K cognitive users is 

N ft 

KW = BKTJ2^{1-(1-P;) K } 



N 



= BTJ20 1 {1-(1-P*) K }. 



i=i 



N 



the primary network is BT ^ Oi- This upper bound can be 

i=i 

achieved by a centralized channel allocation strategy when 
K > N (simply by assigning one cognitive user to each 



channel). Therefore, the loss of the distributed protocol as 
compared with the centralized scheduling is 



N 



L = BTJ20 1 (1-P*) K , 



which is same as (9) up to a constant factor. There is an intu- 
itive explanation of this loss. If there is a spectral opportunity 
in channel i but there are no users choosing channel i to sense, 
a loss occurs. The probability that there is no user choosing 
channel i to sense is (1 — p*) K , and hence the probability of 
loss occurring at channel i is 9i(l — p*) K ■ To obtain further 
insights on the performance of the cognitive network, we study 
the following special cases. 

1) N > 1,K = 1. As stated in the above, p*, = 1, and 
Pl — 0, 1 e Af\{i*}. Hence, the user should choose the 
channel with the largest free probability to sense. And 

L = BT 6 *- 

ieAT\{i»} 

2) N = 2,K = 2. Substituting N — 2 and K 
into (10), we obtain 

P* 1 = e 1 /(e 1 + e 2 ) and P * 2 = e 2 /(e 1 + e 2 ). 



= 2 



Furthermore, 

W = 



1 



BT0 1 
2 

BTft 2 
+ 2 
BT9ift 2 



h) 2 



1 



9\ 



(9l + 02f 



+ 02)' 



2(0! 

3) N is fixed, and K — * oo. We have the following 
asymptotic characterization. 

Lemma 7: Let 2 < Q < N be the number of channels for 
which 0i > 0. We have p* — > l/Q, and L — > exponentially 
as K increases, i.e., 

L~0(e- ClK ), 

where c\ = In ^zj- 

Proof: Without loss of generality, we assume that 0i ^ 0, 
for 1 < i < Q. At the moment, we assume that (we will show 
that this is true, if K is large enough) if 0i ^ 

\K-T)\ + 
1- I I =1 



P 



A* 



A* 



1/(K-1) 



N Q 

Together with ^ p* = ^ p* = 1, we have 

i=i i=i 

(A .)i/<*-U = K^\Q-l) 



Q 

»=i 



-1/(K-1) 



On the other hand, the average total spectral opportunities of and 



(Q - W 



-1/(K-1) 



i=i 



-1/(K-1) 



-, for 1 < i < Q. 
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To satisfy the condition p > 0, we need to show 



i=l 



for all i with 9i > 0. 



E 



With i* = are max 6; and /* = are min 6u we have for 

6 ieAT & \<i<Q 

all i 



(0-1) 



-i/(k-i 



> (Q-l)C 1/( ^- 1} 



< 



E ' 

i=i 



-i/(k-i) 



i* _ _ 



For any # < Q/(Q — 1), if if is large enough, we have 



stf 



since 



lim ( — I 

K^oo \ 6l* 



Hence, for all 1 < i < Q, we have 

{Q-i)e- 1/(K - 1] ^Q_i 
Q 



< 



< l. 



i=i 



-i/(K-i) 



Now, straightforward limit calculation shows that 



as if increases. And 



lim 



= lim 



cxp 



-aK 



i=i 



with ci = In (jpy- ■ 
The reason for the exponential decrease in the loss is that, 
as the number of cognitive users increases, the probability 
that there is no user sensing any particular channel decreases 
exponentially. If Q — 1, there is no loss of performance, since 
the all the user will always sense the channel with non-zero 
availability probability. 

2) The Game Theoretic Model: The optimality of the 
distributed protocol proposed in the previous section hinges 
on the assumption that all the users will follow the symmetric 
rule. However, it is straightforward to see that if a single 
cognitive user deviates from the rule specified in Lemma 6, it 
will be able to transmit more bits. If this selfish perspective 
propagates through the network, it may lead to a significant 
reduction in the overall throughput. This observation motivates 
our next step in which the channel selection problem is 
modeled as a non-cooperative game, where the cognitive users 
are the players, the T^s are the strategies and the average 
throughput of each user is the payoff. The following result 
derives a sufficient condition for the Nash equilibrium [20] in 
the asymptotic scenario K — > oo. 



Lemma 8: (r 1; --- ,T K ) is a Nash-equilibrium, if K is 
large and at each time slot, there are TiK users sensing channel 
i, where Tj satisfies 



n = 



N 

E 



E 6i 



(11) 



of trans- 



At this equilibrium, each user has probability 
mitting at each time slot. 

Proof: We prove this by backward induction. At the last 
time slot T, if t^s satisfy equation (11), the probability of user 
k gaining a channel is 

N 

a Y,0i 



Pk 



K 



Now, if user k deviates from this strategy, and chooses 
channel i , the number of users sensing channel i is K + 1, 
and the probability of user k gaining the channel is 



Pk = 



< 



TsK+1 T;' K 



Pk- 



Hence the strategy that has TiK users sensing channel i at 
time slot T is a Nash equilibrium. Now, we know the optimal 
strategy for the last time-slot, so we can ignore this time slot. 
Then time slot T — 1 becomes the last slot, in which this 
strategy is optimal. Similarly, we show that this strategy is 
optimal for all other time slots. ■ 

We note that in the lemma we implicitly assume that TiK 
is an integer. In practice, this is not always true. However, 
since K is large, rounding TiK to the nearest integer will 
have minor effects. The Nash equilibrium is also optimal from 
a system perspective, in the sense that this strategy maximizes 
the total throughput of the whole network by fully utilizing 
the available spectral opportunities when K is large (i.e., on 
the average, each user will be able to transmit BT ^ 9t bits per 
block, and the total throughput of the network is BT E 

With this equilibrium result, the cognitive users can use 
the following stochastic sensing strategy to approximately 
work on the equilibrium point for a large but finite K. Let 
Sk(j) be the channel chosen by user k at time slot j. At 
each time slot, each user independently selects channel i with 
probability = ^g. , i- e -, Pr{ s fc(j) = i} = T %- Then at 

each time slot, the number of users sensing channel i will be 

K 

E I{sk(j) — i}, where the I{sk(j) = i}s are i.i.d Bernoulli 

fe=i 

random variables. Hence, the total number of users sensing 
channel i is a binomial random number, and the fraction of 
users sensing channel i converges to r» in probability as K 
increases, i.e. 

tl{sk(j)=i} 

' k=l 

T = K >n 

in probability. Hence, as K increases, the operating point will 
converge to the Nash equilibrium in probability. 
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For any K, the probability that there is no user choosing 
channel i to sense is (1 — n) K . Hence the performance loss 
compared with the centralized scheme is 



Hence 



N 



K 



It is easy to check that 



lim 



BTO^, 



where Oi* = min{#i : Oi > 0}, and = In 



Eg. 



is now clear that the loss of the game theoretic scheme goes 
to zero exponentially, though the decay rate is smaller than 
that of the scheme specified in Lemma 6. On the other hand, 
compared with the scheme in Lemma 6, the game theoretic 
scheme has the advantage that the individual cognitive users 
do not need to know the total number of cognitive users K in 
the network and, more importantly, they have no incentive to 
deviate unilaterally. 

B. Unknown Case 

If 6 is unknown, the cognitive users need to estimate (in 
addition to resolving their competition). Combining the results 
from Sections III-B and IV-A, we design the following low 
complexity strategy which is asymptotically optimal. 

Rule 2: 1) Initialization: Each user k maintains the follow- 
ing two vectors: which records the number of time slots 
in which user k has sensed each channel to be free; and Yfe, 
which records the number of time slots in which user k has 
sensed each channel. At the beginning of each block, user k 
senses each channel once and transmits through this channel 
if the channel is free and it wins the competition. Also, set 
Xk,i — 1, regardless of the sensing result of this stage. 

2) At the beginning of time slot j, user k estimates Oi as 

§i(j) = X k 4j)/Y k4 (j), 
and chooses each channel i G JV with probability 

to 



JV 

£< 

»=i 



(12) 



After each sensing, ~K k and are updated. □ 
Lemma 9: If K is large, the scheme in Rule 2 converges 
to the Nash equilibrium specified in Lemma 8 in probability, 
as T increases. 

Proof: Xks is the sum of Y k ^ i.i.d Bernoulli random 
variables with parameter t . We use the following form of the 
Chernoff bound. Let X be the sum of n independent Bernoulli 
random variables with parameter 0, then 

(_ 71 a<:2 ' 

for any 8 < 1. 

At time slot j, if we replace X with Xks(j), n with Yfe,i(j), 
6 with Oi and let 8 = 1/2, then we have 

Pr{*fc,i(j) < \0iYkAj)\ <exp(-Y kii (j)0i/8). 



> 1 - cxp (-Oi/S) , 



(13) 



since after the initialization period, Y k s(j) > 1. 

Note that Y k ,i(T) is the total number of time slots that user 
k has sensed channel i in each block with T time slots. We 
have 

It E{Y M (T)} = E\j2l{Sk(j)=i} 



^E{I{S k (j)=i}} 



= E E 



(a) 
> 



(b) 
> 



3 = 1 
T 



XkAj)/Y k Aj) 
E Xk,i{j)/Y k ,iV) 



f X k ,(j)/Y k 4j) \ 

3=1 ^ ' ' 



E- 

3 = 1 



^(l-exp(-^/8)) 



2A^ 



where (a) follows from the fact that X k .i{j) /Y k s{j) < 1, and 
(b) follows from (13). 

The probability that Y Kt (T) < (1 - 5)E{Y fc)i (T)} can also 
be bounded using the Chernoff bounds since Y k ^ (T) is also the 
sum of independent Bernoulli random variables. In particular, 
we have 

Pr{Y M (T) < (1 - 8)E{Y k 4T)}} < cxp (^M^IA^ 
On letting 



8 = 



/ lnE{y M (T)} 
E{r fe ,,(T)} ■ 



we have 



Pr{r fe ,,(T) < E{Yfc ; j(T)} - lnE{F fcjl (T)}} < 



Using the union bound, and the weak law of large numbers, 
Xk,i(j)/Yk,i{j) converges to Oi in probability as T increases 
(with probability larger than 1 — 1/T). The scheme becomes 
the same as the known 6 case, in which we know that the 
operating point is approximately at the Nash equilibrium, if 
K is sufficiently large. ■ 

The intuition behind this scheme is that, each user will 
sample each channel at least 0(T) times, and hence as 
T increases, the estimate converges to in probability 
implying that the unknown 9 case will eventually reduce to 
the case in which 6 is known to all the users. Hence, if K 
is sufficiently large, the operating point converges to the Nash 
equilibrium in probability. 

If one can assume that the users will follow the pre-specified 
rule, then we can design the following strategy that converges 
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to the optimal operating point in probability for any K, as T 
increases. 

Rule 3: 1) Initialization: Each user k maintains the follow- 
ing two vectors: X&, which records the number of time slots 
in which user k has sensed each channel to be free, Yfc, which 
records the number of time slots in which user k has sensed 
each channel. At the beginning of each block, user k senses 
each channel once, and transmits through this channel if the 
channel is free and it wins the competition. Also, set Xk,i = 1, 
regardless of what the sensing result at this stage. 

2) At the beginning of time slot j < In T, user k estimates 
9i as 

9iU) = Xk,i(j)/Y kti (j), 
and chooses each channel i e Af with probability 

J2 ^i(i)- F° r 3 > io.T, the i th channel is sensed with 

i=i 

probability 



(14) 



After each sensing, X& and Yfc are updated. □ 
Lemma 10: The proposed scheme converges in probability 
to the optimal operating point specified in Lemma 6, as T 
increases. 

Proof: Following the same steps as the proof of 
Lemma 9, one can show that after O(lnT) times slots, 9 
converges to 9 in probability as T increases. Hence the 
operating point specified by (14) converges in probability to 
the optimal point specified in Lemma 6 as T increases. ■ 

V. Multi-Channel Cognitive Users 

In certain scenarios, cognitive users may be able to sense 
more than one channel simultaneously. To simplify the pre- 
sentation, we assume the presence of only a single cognitive 
user capable of sensing, and subsequently utilizing, M < N 
channels simultaneously. Let A4(j) be the set of channels 
the cognitive user selects to sense at time slot j, where 
I -MO') I = M. The average number of bits that the cognitive 
user is able to send over a block is therefore 

e{w} = eIj2 E BZ SU)U) 
U =1 s(j)eM(j) 

At the beginning of time slot j, the cognitive user can update 
the pdf ft (9) according to (3) and (4). Similar to Lemma 1, 
the optimal solution can be characterized by the following 
optimality condition 



V*(f,T) = max E f 

■M(1)CJV,|A4(1)|=M 



E BZ s(l) 
»(1)EM(1) 

+V* (f{z sW :s(i)eM(i)},T - l) \. 



(15) 



Here, f{z s(1 y.s(i)eM(i)} i s tne updated density after observing 
the sensing output of the channels s(l) € M(l). We can then 
follow the same procedure described for the single-channel 



sensing scenario to obtain the optimal strategy T* according 
to (15). In the following, however, we focus on low complexity 
non-parametric strategies that are asymptotically optimal. 

If 9 is known, the cognitive user will choose the M channels 
with the largest 6's to sense. Without loss of generality, we 
assume 6i > 82 > ■ • ■ > On- Hence, for any strategy T, the 
loss is 

T M T N 

L (0- r) = EE^-E 5 E^Oe M(j)} , 

j— 1 i— 1 j— 1 i— 1 

We have the following order-optimal simple single-index strat- 
egy- 

Rule 4: The cognitive user maintains two vectors X and 
Y, where each Xi is the number of time slots in which the 
cognitive user has sensed channel i to be free, and each Yi 
is the number of time slots in which the cognitive user has 
chosen channel i to sense. The strategy works as follows. 

1) Initialization: at the beginning of each block, each chan- 
nel is sensed once. This initialization stage takes \N/M~\ 
time slots, in which \x~\ denotes the least positive integer 
that is no less than x. 

2) After the initialization period, the cognitive user obtains 
an estimation 9 at the beginning of time slot j given by 

Xi{j) 



and assigns an index 



YiU) 



1 2 In j 



to the i th channel. The cognitive user orders these Ai(j)s 
and selects the M channels with the largest Aj(j)s to 
sense. After each sensing, the cognitive user updates X 
and Y. 

Lemma 11: Rule 4 is asymptotically optimal and L(9, T) ~ 
O(lnT). 

Proof: We bound Y t (T) for i > M + l, i.e., the channels 
that are not among the channels having the M largest values 
of 9. Note that Yi (T) is the total number of time slots in which 
the cognitive user has sensed channel i in a block with T time 
slots. We have 



1 



T 

E 

= [N/M]+1 
T 



I{ieM(j)} 



< 



l{ieM(j)\Yi(j)>m}, 

j=\N/M]+rn 

for any m > 1, where I{x\y} is the conditional indicator 
function, which equals 1 if, conditioning on y, x is satisfied, 
and otherwise equals 0. Since Yi(j) > m, it follows that i G 
A4(j) only if Aj(j) is among the M largest indices. Hence, 
a necessary condition for i e M{j) is 



Ki{j) > min{A;(j) : 1 < / < M}. 



Otherwise, if 



Ai{j) < min{A ; (j) : 1 < I < M}, 
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then the indices of these M channels are already larger than 
that of channel i, and channel i will not be selected. Thus 

l{i€MU)\Yi(j)>m} 

< I {Ai(j) > min{A ; (j) : 1 < I < M}|f 4 (j) > m} 

M 

< ^/{^(j) > Ai(j)\Yi(j) >m}. 



we have (16). 

Similarly, we have 



Pr\A l (j)>e i + 2 l 2h>J 



Yi(j) 



Pr{6i>6i + 



'2 In j 



Hence 



Y i (T)<m+ E 7 { A ^') ^ A '0')|*i(j) > m } 

j=[JV/Ml+m i=l 
M f T 

<E S m + E 7 W-?) ^ Mi)|^(i) ^ m } 

In order for A,(j) > A;(j), one of the following three 
conditions must be satisfied 



*{YiU) = q} 



Y(j)>mj 
Y(j) > m 



Mj) < 9u Hi) >0i + 2 ]J^fy or Q i < * + 2 \J^jj- 

One can easily check that, if none of these three conditions 
is satisfied, we will have Ai(j) < A;(j). In the following, we 
bound the probability of each event. 



< ^Pr|^>^ + y 

< 2jcxp- 41nj 

At the same time, if we set 



/2Inj 



Y(j) 



/21nj 



Yi(j) 



Y l {j)>m 1 Y l {j)=q\ 
Yi(j) = q\ 



Pi{M(j)<Ol\Yi(j)>m} 



m = 



ilnT 



(ft - ^ 



< 


Pr| ft- 


-ft 










E^O') 




9=1 






Prj 


ft- 








< 




ft- 




9=1 I 




< 


2jexp- 41nj 




2.r 3 , 





> 



'21nj 



Yi(j) 



Yi{j) > m j 



we have for any 1 < I < M, if Yi(j) > m 



> 



> 



'21nj 



Yi(j) 



hij 



<e l + {e M -e l ) x j^<e M <0i. 



'21nj 



Yi(j) 



Yi(j)>m,Y l (j)=q [ 
Yi(j)=q] 



(16) 



where (16) follows from to the following Chernoff-Hoeffding 
bounds, which says that for n i.i.d Bernoulli random variables 
Xj,j = l,---,n with mean ft 

Ex, 



Hence with this m, 

Pr jft < ft + 2^ 

for each 1 < I < M. 
Thus, 



/21nj 



Y{j) 



Y l {j) > m \ = 0, 



Pr 



> e \ < 2exp- 2ne for all e > 0. (17) 



To see this, we note that in our case, Xi (j) is the sum of Yi (j) 
i.i.d Bernoulli random variables with parameter ft. On setting 



Pr{AiO') > Hj)\Yi(j) >m) 
< Pr{AiC?) <0i\Yi(j) >m} 



" = and e : 

also using the fact that 



' 2 In j 



+?r{A i (j)>e i + 2. 



'21nj 



ft = 



ZZiti) 

Yi(j) ' 



+Pr jft <ft + 2 1 



'21nj 



^(J) 



TO 

Yi(j) > m 
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E{r 2 (r)} 

< 



E | E | m+ 

vT/Ml+m J J 



M 

E 



81nT 



E 



E 



< M 



{6i - 6 M ) 2 
81nT 



ilnT 



(0i - o M f 



1M 



E ^ 

j=[iV/M]+m 



O(lnT), 



since 



E y ■ • E " ; 

j=[JV/Ml+m J=l 

oo 

and J 3 exists. 

i=i 

Hence from (18), we have that, for any channel that is not 
among the best M channels, the average number of time slots 
for which this channel is selected is bounded by O(lnT). 
Thus, the loss is of order O(lnT). 

On the other hand, it has been proved in [21] that for any 
consistent strategy, 

. L{8-T) 
lim mi — — — — > Ci, 

t^oo InT ~ 

with some constant Ci. This completes the proof. ■ 



VI. Conclusions 

This work has developed a unified framework for the design 
and analysis of cognitive medium access based on the classical 
bandit problem. In the single user scenario, our formulation 
highlights the tradeoff between exploration and exploitation 
in cognitive channel selection. A linear complexity cognitive 
medium access algorithm, which is asymptotically optimal as 
T — > oo, has been proposed. The multi-user setting has also 
been formulated, as a competitive bandit problem enabling the 
design of efficient and game theoretically fair medium access 
protocols. Finally, these ideas have been extended to the multi- 
channel scenario in which the cognitive user is capable of 
utilizing several channels simultaneously. 

Our results motivate several interesting directions for future 
research. For example, developing optimal medium access 
strategies by taking sensing error into consideration and is 
of practical significance. Applying other powerful tools from 
sequential analysis to design and analyze wireless networks is 
a promising research direction. 
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